Unable to correctly scrap bad formatted table

rcarvalheira · April 5, 2022, 5:50pm

I´m trying to scrape a table that has one of the group values in another row.

Basically, I want to generate a table with the following columns

1 - name
2 - type
3 - company id
4 - company name
5 - value

I´m able to generate all but the number 2 "type".
Since it does not have a common parent element, I´m unable to have it scraped in the correct way.
I´m only able to add it as a multiple value and cross-reference with all "company" lines, instead of only the ones beneath it.

Url: https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm

Here the HTML structure

Sitemap:
{id:"{"_id":"camara_municipal_sp_gastos","startUrl":["https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/2019[01-12].htm","https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/2020[01-12].htm","https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/2021[01-12].htm","https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/2022[01-03].htm"],"selectors":[{"id":"grup_vereador","parentSelectors":["_root"],"type":"SelectorElement","selector":"table[class='bloco']","multiple":true,"delay":0},{"id":"vereador","parentSelectors":["grup_vereador"],"type":"SelectorText","selector":"b","multiple":false,"delay":0,"regex":"\\s(.*)"},{"id":"fornecedor_cnpj","parentSelectors":["tr_table"],"type":"SelectorText","selector":"td[width='15%']","multiple":false,"delay":0,"regex":""},{"id":"fornecedor_nome","parentSelectors":["tr_table"],"type":"SelectorText","selector":"td[width='60%']","multiple":false,"delay":0,"regex":""},{"id":"fornecedor_valor_utilizado","parentSelectors":["tr_table"],"type":"SelectorText","selector":"td[width='20%']","multiple":false,"delay":0,"regex":""},{"id":"tr_table","parentSelectors":["grup_vereador"],"type":"SelectorElement","selector":"tr:has(> td:nth-child(2)[width=\"60%\"])","multiple":true,"delay":0},{"id":"competencia","parentSelectors":["_root"],"type":"SelectorText","selector":"h3","multiple":false,"delay":0,"regex":"\\d\\d/\\d\\d\\d\\d"}]}"}

ViestursWS · April 6, 2022, 5:39am

@rcarvalheira Hi, it appears that the most viable way to extract the desired data points will require using the 'Grouped' selector. If necessary you can apply additional data post-processing in order to divide the scraped results by a new line using the parser feature within Web Scraper Cloud.

Example:

{"_id":"sisgvarmazenamento-blob-core-windows-net","startUrl":["https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm"],"selectors":[{"delay":0,"id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":"div:nth-of-type(2) .bloco > tbody","type":"SelectorElement"},{"delay":0,"id":"name","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"b","type":"SelectorText"},{"delay":0,"extractAttribute":"","id":"type","parentSelectors":["wrapper"],"selector":"td[colspan='3']","type":"SelectorGroup"},{"delay":0,"extractAttribute":"","id":"company-id","parentSelectors":["wrapper"],"selector":"> tr > td > table[border='1'] td[width='15%']","type":"SelectorGroup"},{"delay":0,"extractAttribute":"","id":"company-name","parentSelectors":["wrapper"],"selector":"td[width='60%']","type":"SelectorGroup"},{"delay":0,"extractAttribute":"","id":"value","parentSelectors":["wrapper"],"selector":"> tr > td > table[border='1'] td[width='20%']","type":"SelectorGroup"}]}

rcarvalheira · April 6, 2022, 11:37am

Tks for the example.
I never used the grouped function before.
Still, the problem that I see is the mismatch since I may have multiple companies per type.

I did kind of a parse with Google Sheets and this was the result.
Some mismatches and by the end companies without a type. Probably the Cloud Parse would do similar, right? (I don´t have the Cloud service yet)

Bellow the highlighted lines should all have the same type.