Signaler

Extraire des données d'un fichier HTML

Posez votre question rius 8Messages postés mercredi 21 novembre 2007Date d'inscription 5 septembre 2017 Dernière intervention - Dernière réponse le 5 sept. 2017 à 21:08 par Flachy Joe
Bonjour,

J'ai un fichier avec du code html et je souhaiterais pouvoir en extraire des données.
Le code est pas du tout formaté, mais je ne peux pas faire autrement :(

Il est extrait avec l'outil "inspecter" du navigateur

<tbody><tr style="height: auto;"><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 50px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 100px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 90px;"></th></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class="                      " title="VINT2017_1">VINT2017_1</td><td align="center" valign="middle" class="                      " title="DC6">DC6</td><td align="left" valign="middle" class="                      " title="SBEG">SBEG</td><td align="left" valign="middle" class="                      ">SBTF</td><td align="left" valign="middle" class="                      " title="FLC01">FLC01</td><td align="right" valign="middle" class="                      " title="01h 18m">01h 18m</td><td align="right" valign="middle" class="                      " title="281">281</td><td align="right" valign="middle" class="                      " title="95.00">95.00</td><td align="right" valign="middle" class="                      " title="Aug 19 2017">Aug 19 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class="                 " title="VINT2017_7">VINT2017_7</td><td align="center" valign="middle" class="                 " title="DC6">DC6</td><td align="left" valign="middle" class="                 " title="SPJC">SPJC</td><td align="left" valign="middle" class="                 " title="SPZO">SPZO</td><td align="left" valign="middle" class="                 " title="FLC01">FLC01</td><td align="right" valign="middle" class="                 " title="01h 18m">01h 18m</td><td align="right" valign="middle" class="                 " title="316">316</td><td align="right" valign="middle" class="                 " title="101.50">101.50</td><td align="right" valign="middle" class="                 " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class="      " title="VINT2017_6">VINT2017_6</td><td align="center" valign="middle" class="      ">DC6</td><td align="left" valign="middle" class="      " title="SPHI">SPHI</td><td align="left" valign="middle" class="      ">SPJC</td><td align="left" valign="middle" class="      ">FLC01</td><td align="right" valign="middle" class="      " title="01h 24m">01h 24m</td><td align="right" valign="middle" class="      " title="353">353</td><td align="right" valign="middle" class="      " title="96.25">96.25</td><td align="right" valign="middle" class="      " title="Aug 20 2017">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class="    " title="VINT2017_5">VINT2017_5</td><td align="center" valign="middle" class="    " title="DC6">DC6</td><td align="left" valign="middle" class="    ">SECU</td><td align="left" valign="middle" class="    ">SPHI</td><td align="left" valign="middle" class="    ">FLC01</td><td align="right" valign="middle" class="    " title="01h 06m">01h 06m</td><td align="right" valign="middle" class="    ">239</td><td align="right" valign="middle" class="    ">103.50</td><td align="right" valign="middle" class="    ">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class="       ">VINT2017_9</td><td align="center" valign="middle" class="       ">DC6</td><td align="left" valign="middle" class="       ">SLLP</td><td align="left" valign="middle" class="       " title="SLET">SLET</td><td align="left" valign="middle" class="       " title="FLC01">FLC01</td><td align="right" valign="middle" class="       " title="01h 18m">01h 18m</td><td align="right" valign="middle" class="       ">297</td><td align="right" valign="middle" class="       ">-1.75</td><td align="right" valign="middle" class="       ">Aug 24 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_10</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" " title="SLET">SLET</td><td align="left" valign="middle" class=" ">SBCY</td><td align="left" valign="middle" class=" ">FLC01</td><td align="right" valign="middle" class=" ">01h 48m</td><td align="right" valign="middle" class=" ">426</td><td align="right" valign="middle" class=" ">105.50</td><td align="right" valign="middle" class=" ">Aug 25 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_4</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SPQT</td><td align="left" valign="middle">SECU</td><td align="left" valign="middle">FLC01</td><td align="right" valign="middle">01h 30m</td><td align="right" valign="middle">345</td><td align="right" valign="middle">101.50</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" " title="VINT2017_3">VINT2017_3</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SBTT</td><td align="left" valign="middle" class=" ">SPQT</td><td align="left" valign="middle" class=" ">FLC01</td><td align="right" valign="middle" class=" ">01h 06m</td><td align="right" valign="middle" class=" ">203</td><td align="right" valign="middle" class=" ">100.00</td><td align="right" valign="middle" class=" ">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">TAP1001</td><td align="center" valign="middle">738</td><td align="left" valign="middle">LPPT</td><td align="left" valign="middle">LPMA</td><td align="left" valign="middle">FLC01</td><td align="right" valign="middle">01h 36m</td><td align="right" valign="middle">521</td><td align="right" valign="middle">101.50</td><td align="right" valign="middle">Aug 30 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_2</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBTF</td><td align="left" valign="middle">SBTT</td><td align="left" valign="middle">FLC01</td><td align="right" valign="middle">01h 24m</td><td align="right" valign="middle">316</td><td align="right" valign="middle">87.75</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_8</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SPZO</td><td align="left" valign="middle" class=" ">SLLP</td><td align="left" valign="middle" class=" ">FLC01</td><td align="right" valign="middle" class=" ">01h 12m</td><td align="right" valign="middle" class=" ">282</td><td align="right" valign="middle" class=" ">101.50</td><td align="right" valign="middle" class=" " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_10</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SLET</td><td align="left" valign="middle" class=" ">SBCY</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 48m</td><td align="right" valign="middle" class=" ">426</td><td align="right" valign="middle" class=" ">98.25</td><td align="right" valign="middle" class=" " title="Aug 25 2017">Aug 25 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_9</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SLLP</td><td align="left" valign="middle" class=" ">SLET</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 18m</td><td align="right" valign="middle" class=" ">297</td><td align="right" valign="middle" class=" ">63.50</td><td align="right" valign="middle" class=" " title="Aug 24 2017">Aug 24 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_8</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SPZO</td><td align="left" valign="middle" class=" ">SLLP</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 18m</td><td align="right" valign="middle" class=" ">282</td><td align="right" valign="middle" class=" ">100.25</td><td align="right" valign="middle" class=" " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" ">TAP1001</td><td align="center" valign="middle" class=" ">738</td><td align="left" valign="middle" class=" ">LPPT</td><td align="left" valign="middle" class=" ">LPMA</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 48m</td><td align="right" valign="middle" class=" ">521</td><td align="right" valign="middle" class=" ">103.50</td><td align="right" valign="middle" class=" " title="Aug 30 2017">Aug 30 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class="   ">VINT2017_7</td><td align="center" valign="middle" class="   ">DC6</td><td align="left" valign="middle" class="   ">SPJC</td><td align="left" valign="middle" class="   ">SPZO</td><td align="left" valign="middle" class="   ">FLC02</td><td align="right" valign="middle" class="   " title="02h 00m">02h 00m</td><td align="right" valign="middle" class="   ">316</td><td align="right" valign="middle" class="   ">100.00</td><td align="right" valign="middle" class="   " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_6</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SPHI</td><td align="left" valign="middle">SPJC</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 24m</td><td align="right" valign="middle">353</td><td align="right" valign="middle">100.00</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_5</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SECU</td><td align="left" valign="middle">SPHI</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 00m</td><td align="right" valign="middle">239</td><td align="right" valign="middle">88.75</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_4</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SPQT</td><td align="left" valign="middle">SECU</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 30m</td><td align="right" valign="middle">345</td><td align="right" valign="middle">100.00</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_3</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBTT</td><td align="left" valign="middle">SPQT</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 06m</td><td align="right" valign="middle">203</td><td align="right" valign="middle">105.50</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_2</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBTF</td><td align="left" valign="middle">SBTT</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 24m</td><td align="right" valign="middle">316</td><td align="right" valign="middle">91.50</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_1</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBEG</td><td align="left" valign="middle">SBTF</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 18m</td><td align="right" valign="middle">281</td><td align="right" valign="middle">93.50</td><td align="right" valign="middle">Aug 19 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">TAP1002</td><td align="center" valign="middle">738</td><td align="left" valign="middle">LEMG</td><td align="left" valign="middle">LEMH</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 18m</td><td align="right" valign="middle">454</td><td align="right" valign="middle">102.25</td><td align="right" valign="middle">Aug 31 2017</td></tr></tbody>


Dans le code, on peut trouver plusieurs chaînes qui commence par FLC (exemple FLC01, FLC02) et un peut plus loin on trouve entre guillemet des heures.

Ce que j'aimerais bien, c'est pour chaque code FLC avoir la somme des heures.

Merci pour votre aide
Utile
+0
plus moins
Salut
déjà on peut extraire les données intéressantes :
flo@bidul:~/Test$ cat data.htm
<tbody><tr style="height: auto;"><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 50px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 100px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 90px;"></th></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" " title="VINT2017_1">VINT2017_1</td><td align="center" valign="middle" class=" " title="DC6">
[...]
<td align="left" valign="middle">TAP1002</td><td align="center" valign="middle">738</td><td align="left" valign="middle">LEMG</td><td align="left" valign="middle">LEMH</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 18m</td><td align="right" valign="middle">454</td><td align="right" valign="middle">102.25</td><td align="right" valign="middle">Aug 31 2017</td></tr></tbody>

flo@bidul:~/Test$ sed "s/tr><tr/>\n</g;s/<[^>]*>/,/g;s/,,/,/g" < data.htm | cut -d, -f6-7
,
FLC01,01h 18m
FLC01,01h 18m
FLC01,01h 24m
FLC01,01h 06m
FLC01,01h 18m
FLC01,01h 48m
FLC01,01h 30m
FLC01,01h 06m
FLC01,01h 36m
FLC01,01h 24m
FLC01,01h 12m
FLC02,01h 48m
FLC02,01h 18m
FLC02,01h 18m
FLC02,01h 48m
FLC02,02h 00m
FLC02,01h 24m
FLC02,01h 00m
FLC02,01h 30m
FLC02,01h 06m
FLC02,01h 24m
FLC02,01h 18m
FLC02,01h 18m


je réfléchis à la suite...
;-) Flachy Joe ;-)
"Qui ne se plante jamais n'a aucune chance de pousser !" Graf anonyme
Flachy Joe 2107Messages postés jeudi 16 septembre 2004Date d'inscription 12 septembre 2017 Dernière intervention - 3 sept. 2017 à 17:36
Avec un peu de awk ça donne ça :
sed "s/tr><tr/>\n</g;s/<[^>]*>/,/g;s/,,/,/g;s/h /,/g" < data.htm | cut -d, -f6-8 | awk -F "," '{ t[$1]+=$2; t1[$1]+=$3} END {for(n in t)printf "%s %ih%imn\n", n, t[n]+int(t1[n]/60), t1[n]%60 }'
0h0mn
FLC01 15h0mn
FLC02 17h12mn


Il y a sans doute moyen de tout faire en awk mais je suis pas expert...
Répondre
rius 8Messages postés mercredi 21 novembre 2007Date d'inscription 5 septembre 2017 Dernière intervention - 5 sept. 2017 à 11:31
Merci beaucoup, je vais aller mettre sa en place de suite
Répondre
Flachy Joe 2107Messages postés jeudi 16 septembre 2004Date d'inscription 12 septembre 2017 Dernière intervention - 5 sept. 2017 à 21:08
Si y a besoin d'explications sur les différentes commandes, faut pas hésiter.
Répondre
Donnez votre avis

Les membres obtiennent plus de réponses que les utilisateurs anonymes.

Le fait d'être membre vous permet d'avoir un suivi détaillé de vos demandes.

Le fait d'être membre vous permet d'avoir des options supplémentaires.

Vous n'êtes pas encore membre ?

inscrivez-vous, c'est gratuit et ça prend moins d'une minute !