VBA to get the href value

VBA to get the href value

I am writing macro to extract the href value from a website, example here is to extract the value: ‘/listedco/listconews/SEHK/2015/0429/LTN201504291355_C.pdf’ from the html code below. The href is one of the attributes of the html tag ‘a’, I have add the code getElementbyTagName’a’ but it did not work, my question is how to extract that href value to column L. Anyone could help? Thanks in advance!

<a id="ctl00_gvMain_ctl03_hlTitle" class="news" href="/listedco/listconews/SEHK/2015/0429/LTN201504291355_C.pdf" target="_blank">二零一四年年報</a>

Sub Download_From_HKEX()
Dim internetdata As Object
Dim div_result As Object
Dim header_links As Object
Dim link As Object
Dim URL As String
Dim IE As Object
Dim i As Object
Dim ieDoc As Object
Dim selectItems As Variant
Dim h As Variant

Dim LocalFileName As String
Dim B As Boolean
Dim ErrorText As String
Dim x As Variant

'Key Ratios
For x = 1 To 1579
Set IE = New InternetExplorerMedium
IE.Visible = True
URL = "http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main_c.aspx"
IE.navigate URL
Do
DoEvents
Loop Until IE.readyState = 4
Application.Wait (Now + TimeValue("0:00:05"))
Call IE.Document.getElementById("ctl00_txt_stock_code").setAttribute("value", Worksheets("Stocks").Cells(x, 1).Value)

Set selectItems = IE.Document.getElementsByName("ctl00$sel_tier_1")
For Each i In selectItems
i.Value = "4"
i.FireEvent ("onchange")
Next i

Set selectItems = IE.Document.getElementsByName("ctl00$sel_tier_2")
For Each i In selectItems
i.Value = "159"
i.FireEvent ("onchange")
Next i

Set selectItems = IE.Document.getElementsByName("ctl00$sel_DateOfReleaseFrom_d")
For Each i In selectItems
i.Value = "01"
i.FireEvent ("onchange")
Next i

Set selectItems = IE.Document.getElementsByName("ctl00$sel_DateOfReleaseFrom_m")
For Each i In selectItems
i.Value = "04"
i.FireEvent ("onchange")
Next i

Set selectItems = IE.Document.getElementsByName("ctl00$sel_DateOfReleaseFrom_y")
For Each i In selectItems
i.Value = "1999"
i.FireEvent ("onchange")
Next i

Application.Wait (Now + TimeValue("0:00:02"))
Set ieDoc = IE.Document
With ieDoc.forms(0)
Call IE.Document.parentWindow.execScript("document.forms[0].submit()", "JavaScript")
.submit
End With
Application.Wait (Now + TimeValue("0:00:03"))

'Start here to extract the href value.
Set internetdata = IE.Document
Set div_result = internetdata.getElementById("ctl00_gvMain_ctl03_hlTitle")
Set header_links = div_result.getElementsByTagName("a")
For Each h In header_links
Set link = h.ChildNodes.Item(0)
Worksheets("Stocks").Cells(Range("L" & Rows.Count).End(xlUp).Row + 1, 12) = link.href
Next
Next x
End Sub

Is there anyone could help?
– Nicholas Kan
Sep 20 ’15 at 12:47

What is the problem you have encountered? It’s not clear from your question. Edit your question to elaborate.
– omegastripes
Sep 20 ’15 at 20:18

div_result.getElementsByClassName("a") >> div_result.getElementsByTagName("a")
– Tim Williams
Sep 21 ’15 at 0:26

div_result.getElementsByClassName("a")

div_result.getElementsByTagName("a")

@TimWilliams Hi Tim, thanks for your answer but sorry it is just a typo. I did tried getElementsByTagName, as well as getElementsByClassName(“news"), they did not work, a solution may be get the attribute “href" after getting the TagName"a", since the “href" is one of the attributes of tag “a". But I don’t know the code to get attributes, could you help?
– Nicholas Kan
Sep 21 ’15 at 1:02

Your anchor element with target href has id="ctl00_gvMain_ctl03_hlTitle, so you can retrirve url IE.document.getElementById("ctl00_gvMain_ctl03_hlTitle").href or simply IE.document.ctl00_gvMain_ctl03_hlTitle.href. Also try to retrieve the data you need via XHR instead of IE.
– omegastripes
Sep 21 ’15 at 2:55

id="ctl00_gvMain_ctl03_hlTitle

IE.document.getElementById("ctl00_gvMain_ctl03_hlTitle").href

IE.document.ctl00_gvMain_ctl03_hlTitle.href

3 Answers
3

For Each h In header_links
Worksheets("Stocks").Cells(Range("L" & Rows.Count).End(xlUp).Row + 1, 12) = h.href
Next

EDIT: The id attribute is supposed to be unique in the document: there should only be a single element with any given id. So

id

IE.Document.getElementById("ctl00_gvMain_ctl03_hlTitle").href

should work.

@Tim Williams, Hi Tim, your code works for extracting all the href links on the website but what I want to extract is only those with id="ctl00_gvMain_ctl03_hlTitle", the href attribute is one of those attributes in the tag <a>, I think something like getAttribute would work but I don’t know the coding for that, could you help again? Thanks for your patience!
– Nicholas Kan
Sep 21 ’15 at 5:40

id="ctl00_gvMain_ctl03_hlTitle"

@Tim Hi Tim, I am currently using your firs answer which is h.href method, it works perfectly. But I did not test your updated answer. However, thanks a lot!
– Nicholas Kan
Sep 22 ’15 at 15:57

h.href

WB.Document.GetElementById("ctl00_gvMain_ctl04_hlTitle").GetAttribute("href").ToString

Use a CSS selector to get the element then access its href attribute.

href

#ctl00_gvMain_ctl03_hlTitle

The above is element with id ctl00_gvMain_ctl03_hlTitle. "#" means id.

id ctl00_gvMain_ctl03_hlTitle

"#"

Debug.Print IE.document.querySelector("#ctl00_gvMain_ctl03_hlTitle").getAttribute("href").innerText

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

VBA Webscrape not picking up elmenents; pick up frames/tables?

VBA Webscrape not picking up elmenents; pick up frames/tables?

Tried asking this question. Didn’t get many answers. Can’t install things onto my work computer. https://stackoverflow.com/questions/29805065/vba-webscrape-not-picking-up-elements

Want to scrape a morningstar page into Excel with the code below. Problem is, it doesn’t feed any real elements/data back. I actually just want the Dividend and cap gain distribution table really from that link I put into my_Page.

This is usually easiest way, but an entire page scrape way, AND Excel–>Data–>From Web DON’T work.

I’ve tried to use get elements by tag name and class before, but I failed at being able to do it in this case.This might be the way to go… Once again, just want that Dividend and Cap Gain distribution table. Not seeing any results in via the Debug.print

Working code below, just need to parse into excel. Updated attempt below:

Sub Macro1()

Dim IE As New InternetExplorer
IE.Visible = True
IE.navigate "http://quotes.morningstar.com/fund/fundquote/f?&t=ANNPX&culture=en_us&platform=RET&viewId1=2046632524&viewId2=3141452350&viewId3=3475652630"
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE
Dim doc As HTMLDocument
Set doc = IE.document

'For Each Table In doc.getElementsByClassName("gr_table_b1")

'For Each td In Table.getElementsByTagName("tr")
On Error Resume Next
For Each td In doc.getElementsByClassName("gr_table_row4")
Debug.Print td.Cells(5).innerText
'Debug.Print td.Cells(1).innerText
Next td
'Next Table

'IE.Quit
'Application.EnableEvents = True

End Sub

2 Answers
2

The content in question is contained within an iframe. You can see this by right clicking on that section of the sebsite, and selecting Inspect element. Looking up the tree, you’ll see an iframe tag, containing the url of data. You should try to find that element, and extract that url (which is generated with js) and then open that page.

Inspect element

My current environment has an old version of IE, which does not render the page properly, so I cannot build something to actually do this.
– Degustaf
May 29 ’15 at 15:51

iframe is this quotes.morningstar.com/fund/fundquote/…. My debug yields something now (It just says “YTD"). I think if I can get those tags correct I’ll be in business now.
– pjhollow
May 29 ’15 at 17:11

@pjhollow Just so you know. I got a different URL on my end. The viewIDs seam to be generated by the js. I’m not sure what will happen if you hard code those values.
– Degustaf
May 29 ’15 at 17:29

Thanks for taking the time to look into this. Mind pasting what URL you are seeing?
– pjhollow
May 29 ’15 at 17:30

@pjhollow quotes.morningstar.com/fund/fundquote/…
– Degustaf
May 29 ’15 at 17:32

No frame to worry about. You only need the table id.

Webpage view:

Web view

Print out from code:

Code print out

VBA:

Option Explicit
Public Sub GetDivAndCapTable()
Dim ie As New InternetExplorer, hTable As HTMLTable
Const URL = "http://quotes.morningstar.com/fund/fundquote/f?&t=ANNPX&culture=en_us&platform=RET&viewId1=2046632524&viewId2=3141452350&viewId3=3475652630"
Application.ScreenUpdating = False
With ie
.Visible = True

.navigate URL

While .Busy Or .READYSTATE < 4: DoEvents: Wend

Set hTable = .document.getElementById("DividendAndCaptical")
WriteTable hTable, 1
Application.ScreenUpdating = True
.Quit
End With
End Sub

Public Sub WriteTable(ByRef hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)

If ws Is Nothing Then Set ws = ActiveSheet

Dim tSection As Object, tRow As Object, tCell As Object, tr As Object, td As Object, R As Long, C As Long, tBody As Object
R = startRow
With ActiveSheet
Dim headers As Object, header As Object, columnCounter As Long
Set headers = hTable.getElementsByTagName("th")
For Each header In headers
columnCounter = columnCounter + 1
.Cells(startRow, columnCounter) = header.innerText
Next header
startRow = startRow + 1
Set tBody = hTable.getElementsByTagName("tbody")
For Each tSection In tBody 'HTMLTableSection
Set tRow = tSection.getElementsByTagName("tr") 'HTMLTableRow
For Each tr In tRow
Set tCell = tr.getElementsByTagName("td")
C = 1
For Each td In tCell 'DispHTMLElementCollection
.Cells(R, C).Value = td.innerText 'HTMLTableCell
C = C + 1
Next td
R = R + 1
Next tr
Next tSection
End With
End Sub

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

BS4/Python3 can’t open other href while scrapping on google

BS4/Python3 can’t open other href while scrapping on google

My job is with a startup and they’re calling some businesses but they’re buying the contacts. So I had the idea to scrape them from Google, like some hotels, etc…

I can already get the link that opens the Googlemaps with lots of companies but can’t take the information inside this link because the program crashes.

import json
from bs4 import BeautifulSoup as bs
from collections import namedtuple
from pprint import pprint
from requests import get
import requests

def remove_escape(s):
return ' '.join(s.split())

def get_jobs(url):
vagas = get(url, headers=headers)
vagas_page = bs(vagas.text, 'html.parser')
boxes = vagas_page.find_all('div', {'class': 'idQ6DBVUh1_8- ptqfrjbX76M'})
for box in boxes:
titulo = box.find('span', {'class': 'ellip'}).text
empresa = box.find('span', {'class': 'rllt__details'}).text
yield vaga(
remove_escape(titulo),
remove_escape(empresa)
)

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

vaga = namedtuple('Vaga', 'Titulo Empresa')
base_url = 'https://www.google.com.br/'
base_url2 = 'https://www.google.com.br'
job = 'empresas+de+telemarketing'
jobs = '{}search?q={}'.format(base_url, job)
vagas = get(jobs, headers=headers)
vagas_page = bs(vagas.text, 'html.parser')
linke = vagas_page.select('.H93uF a')
esse = (linke[0]['href'])
urls = '{}{}'.format(base_url2, esse)

for url in urls:
print(list(get_jobs(url)))

I don’t know if I was clear enough, but you can look at the base link and the target of the urls string.

Also just don’t look at the name of the strings, if anyone can help me to make it run.

EDIT 1 : link below with the bug, sorry i’ve forgotten to do that earlyer
https://imgur.com/a/VrAMgVL

Can you post the error message?
– Winston Yang
Jun 29 at 17:50

See this video youtube.com/watch?v=kktO7IOjpgs and this tutorial and github.com/dunossauro/live-de-python/tree/master/codigo/Live21 refactor your code.
– Regis da Silva
Jun 30 at 3:52

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

NoneType error during multi-page scrape

NoneType error during multi-page scrape

I’m working on a web scraper and am close to getting what I need, but I can’t figure out why I’m getting a NoneType error all of a sudden after it finishes scraping the fourth page (of 204). Here’s my code:

script_path = os.path.dirname(os.path.realpath(__file__))

driver = webdriver.PhantomJS(executable_path="/usr/local/bin/bin/phantomjs", service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any'])

case_list =

#this function launches the headless browser and gets us to the first page of results, which we'll scrape using main
def search():

driver.get('https://www.courts.mo.gov/casenet/cases/nameSearch.do')

if 'Service Unavailable' in driver.page_source:
log('Casenet website seems to be down. Receiving "service unavailable"')
driver.quit()
gc.collect()
return False

time.sleep(2)

court = Select(driver.find_element_by_id('courtId'))

court.select_by_visible_text('All Participating Courts')

case_enter = driver.find_element_by_id('inputVO.lastName')

case_enter.send_keys('Wakefield & Associates')

year_enter = driver.find_element_by_id('inputVO.yearFiled')

year_enter.send_keys('2018')

driver.find_element_by_id('findButton').click()

time.sleep(3)

#scrapes table and stores what we need in a list of lists
def main():

parties =

dates =

case_nums =

html = driver.page_source

soup = BeautifulSoup(html, 'html.parser')

table = soup.findAll('table', {'class':'outerTable'})

for row in table:

col = row.find_all('td', attrs={'class':'td1'})

col2 = row.find_all('td', attrs={'class':'td2'})

all_links = soup.findAll('a')

for cols in col:
if 'V' in cols.text:
cols = cols.string
cols.encode('utf-8').strip()
cols = re.sub("xa0''", '', cols).strip()
parties.append(cols)

for cols in col2:
if 'V' in cols.text:
cols = cols.string
cols.encode('utf-8').strip()
cols = re.sub("xa0''", '', cols).strip()
parties.append(cols)

for link in all_links:
raw_html = str(link)

if 'goToThisCase' in raw_html:

start = raw_html.find("('") + 2
end = raw_html.find("',")
case = raw_html[start:end].strip()
case_nums.append(case)

for i in col:
if '/2018' in i.text:
i = i.string
i.encode('utf-8').strip()
i = re.sub("xa0", '', i).strip()
dates.append(i)

for j in col2:
if '/2018' in j.text:
j = j.string
j.encode('utf-8').strip()
j = re.sub("xa0", '', j).strip()
dates.append(j)

case_list.append(parties)
case_list.append(case_nums)
case_list.append(dates)

return case_list

def page_looper():

main()

count = '1'

print "page %s fully scraped" % count

count = str(int(count) +1)

print len(case_list), " cases so far"

print case_list

for count in range(2,9):

link = driver.find_element_by_link_text(str(count))

link.click()

time.sleep(2)

main()

print "page %s fully scraped" % count

count = str(int(count) +1)

print len(case_list), " cases so far"

print case_list

next_page_link = driver.find_element_by_partial_link_text('Next')
print "Next 10 pages found"
next_page_link.click()
time.sleep(2)

try:

page_looper()

except Exception:

print "no more cases"

#pprint.pprint(case_list)

#data = zip(case_list[0],case_list[1],case_list[2])

#pprint.pprint(data)

# with open(script_path + "/cases.csv", "w") as f:
# writer = csv.writer(f)
# for d in data:
# writer.writerow(d)
search()

page_looper()

After it finishes scraping the fourth page, it throws:

Traceback (most recent call last):
File "wakefield.py", line 175, in <module>
page_looper()
File "wakefield.py", line 140, in page_looper
main()
File "wakefield.py", line 84, in main
cols.encode('utf-8').strip()
AttributeError: 'NoneType' object has no attribute 'encode'

Any idea what gives?

I’m also unclear how to make my lists of lists work to export to a csv in the end, where each case is on a row, and the columns are parties, case_num, dates. Thanks in advance.

1 Answer
1

AttributeError: 'NoneType' object has no attribute 'encode'

means you are trying to invoke a method encode on None object. To prevent this, you have to check if the Object is not None.

encode

None

Object

None

Replace:

for cols in col:
if 'V' in cols.text:
cols = cols.string
cols.encode('utf-8').strip()
cols = re.sub("xa0''", '', cols).strip()
parties.append(cols)

with:

for cols in col:
if 'V' in cols.text:
if cols.string: # check if 'cols.string' is not 'None'
cols = cols.string
cols.encode('utf-8').strip()
cols = re.sub("xa0''", '', cols).strip()
parties.append(cols)

I did, and then it threw: AttributeError: ‘NavigableString’ object has no attribute ‘text’ — plus I thought I solved the WebElement issue by using .text in the if statement?
– jayohday
Jun 29 at 19:28

Have edited my answer, please check
– Andrei Suvorkov
Jun 29 at 19:35

Made the change and I’m still getting the same error. I’m beginning to think it’s an issue with the website.
– jayohday
Jun 29 at 19:55

I think I have found the problem, please check edited answer.
– Andrei Suvorkov
Jun 30 at 0:15

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Excel VBA Macro: Scraping data from site table that spans multiple pages

Excel VBA Macro: Scraping data from site table that spans multiple pages

Thanks in advance for the help. I’m running Windows 8.1, I have the latest IE / Chrome browsers, and the latest Excel. I’m trying to write an Excel Macro that pulls data from StackOverflow (https://stackoverflow.com/tags). Specifically, I’m trying to pull the date (that the macro is run), the tag names, the # of tags, and the brief description of what the tag is. I have it working for the first page of the table, but not for the rest (there are 1132 pages at the moment). Right now, it overwrites the data everytime I run the macro, and I’m not sure how to make it look for the next empty cell before running.. Lastly, I’m trying to make it run automatically once per week.

I’d much appreciate any help here. Problems are:

Code (so far) is below. Thanks!

Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum

Sub ImportStackOverflowData()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://stackoverflow.com/tags"

'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to StackOverflow ..."
DoEvents
Loop

'show text of HTML document returned
Set html = ie.document

'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""

'clear old data out and put titles in
'Cells.Clear
'put heading across the top of row 3
Range("A3").Value = "Date Pulled"
Range("B3").Value = "Keyword"
Range("C3").Value = "# Of Tags"
'Range("C3").Value = "Asked This Week"
Range("D3").Value = "Description"

Dim TagList As IHTMLElement
Dim Tags As IHTMLElementCollection
Dim Tag As IHTMLElement
Dim RowNumber As Long
Dim TagFields As IHTMLElementCollection
Dim TagField As IHTMLElement
Dim Keyword As String
Dim NumberOfTags As String
'Dim AskedThisWeek As String
Dim TagDescription As String
'Dim QuestionFieldLinks As IHTMLElementCollection
Dim TodaysDate As Date

Set TagList = html.getElementById("tags-browser")
Set Tags = html.getElementsByClassName("tag-cell")
RowNumber = 4

For Each Tag In Tags
'if this is the tag containing the details, process it
If Tag.className = "tag-cell" Then
'get a list of all of the parts of this question,
'and loop over them
Set TagFields = Tag.all

For Each TagField In TagFields
'if this is the keyword, store it
If TagField.className = "post-tag" Then
'store the text value
Keyword = TagField.innerText
Cells(RowNumber, 2).Value = TagField.innerText
End If

If TagField.className = "item-multiplier-count" Then
'store the integer for number of tags
NumberOfTags = TagField.innerText
'NumberOfTags = Replace(NumberOfTags, "x", "")
Cells(RowNumber, 3).Value = Trim(NumberOfTags)
End If

If TagField.className = "excerpt" Then
Description = TagField.innerText
Cells(RowNumber, 4).Value = TagField.innerText
End If

TodaysDate = Format(Now, "MM/dd/yy")
Cells(RowNumber, 1).Value = TodaysDate

Next TagField

'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next

Set html = Nothing

'do some final formatting
Range("A3").CurrentRegion.WrapText = False
Range("A3").CurrentRegion.EntireColumn.AutoFit
Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter
Range("A1:D1").Merge
Range("A1").Value = "StackOverflow Tag Trends"
Range("A1").Font.Bold = True
Application.StatusBar = ""
MsgBox "Done!"
End Sub

Take a look at this and this.
– omegastripes
Dec 27 ’15 at 1:34

3 Answers
3

There’s no need to scrape Stack Overflow when they make the underlying data available to you through things like the Data Explorer. Using this query in the Data Explorer should get you the results you need:

select t.TagName, t.Count, p.Body
from Tags t inner join Posts p
on t.ExcerptPostId = p.Id
order by t.count desc;

The permalink to that query is here and the “Download CSV" option which appears after the query runs is probably the easiest way to get the data into Excel. If you wanted to automate that part of things, the direct link to the CSV download of results is here

Thanks, that definitely works and is much appreciated. That said, I was really using stack-overflow as an example as this is a common issue I encounter with other sites I need to scrape data from. Any ideas on how to do the same thing via the Macro mentioned above?
– user3511310
Nov 4 ’14 at 2:35

I’m not making use of the DOM, but I find it very easy to get around just searching between known tags. If ever the expressions you are looking for are too common just tweak the code a bit so that it looks for a string after a string).

An example:

Public Sub ZipLookUp()
Dim URL As String, xmlHTTP As Object, html As Object, htmlResponse As String
Dim SStr As String, EStr As String, EndS As Integer, StartS As Integer
Dim Zip4Digit As String

URL = "https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=1&companyName=&address1=1642+Harmon+Street&address2=&city=Berkeley&state=CA&urbanCode=&postalCode=&zip=94703"
Set xmlHTTP = CreateObject("MSXML2.XMLHTTP")
xmlHTTP.Open "GET", URL, False
On Error GoTo NoConnect
xmlHTTP.send
On Error GoTo 0
Set html = CreateObject("htmlfile")
htmlResponse = xmlHTTP.ResponseText
If htmlResponse = Null Then
MsgBox ("Aborted Run - HTML response was null")
Application.ScreenUpdating = True
GoTo End_Prog
End If

'Searching for a string within 2 strings
SStr = "<span class=""address1 range"">" ' first string
EStr = "</span><br />" ' second string
StartS = InStr(1, htmlResponse, SStr, vbTextCompare) + Len(SStr)
EndS = InStr(StartS, htmlResponse, EStr, vbTextCompare)
Zip4Digit = Left(Mid(htmlResponse, StartS, EndS - StartS), 4)

MsgBox Zip4Digit

GoTo End_Prog
NoConnect:
If Err = -2147467259 Or Err = -2146697211 Then MsgBox "Error - No Connection": GoTo End_Prog 'MsgBox Err & ": " & Error(Err)
End_Prog:
End Sub

You can improve this to parse out exact elements but it loops all the pages and grabs all the tag info (everything next to a tag)

Option Explicit

Public Sub ImportStackOverflowData()

Dim ie As New InternetExplorer, html As HTMLDocument

Application.ScreenUpdating = False
With ie
.Visible = True

.navigate "https://stackoverflow.com/tags"

While .Busy Or .READYSTATE < 4: DoEvents: Wend

Set html = .document
Dim numPages As Long, i As Long, info As Object, item As Object, counter As Long
numPages = html.querySelector(".page-numbers.dots ~ a").innerText

For i = 1 To 2 ' numPages ''<==1 to 2 for testing; use to numPages
DoEvents
Set info = html.getElementById("tags_list")
For Each item In info.getElementsByClassName("grid-layout--cell tag-cell")
counter = counter + 1
Cells(counter, 1) = item.innerText
Next item
html.querySelector(".page-numbers.next").Click
While .Busy Or .READYSTATE < 4: DoEvents: Wend
Set html = .document
Next i
Application.ScreenUpdating = True
.Quit '<== Remember to quit application
End With
End Sub

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

beautifulsoup multiple keyword from csv file

beautifulsoup multiple keyword from csv file

I have a csv file with 2 column A and B, and I want scrap all the file with beautifulsoup

The url is composed like this : http://…/search?info=A&who=B
how to create a loop?

my code

from bs4 import BeautifulSoup
import requests
import json
import csv

with open('input.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
url = ".../search?info={}&who={}".format(row[0], row[1])
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html5lib")

for p in soup.find_all(class_="crd"):
b = p.find(class_="info")
if b['data-info'] is not None:
j = json.loads(b['data-info'])
data= p.h2.a.string

Why do you need a loop?
– Mad Physicist
Jun 29 at 17:41

And why would you need beautifulsoup if the response is a csv already. Please give a sample output of that request and what you are trying to do.
– Ilhicas
Jun 29 at 17:46

show me your csv a couple of lines like 5 maybe
– Kostadin Slavov
Jun 29 at 18:04

It’s better to use .DictReader() in order to be able to call the desired items using the header like row['header'] instead of using that hardcoded way.
– SIM
Jun 29 at 21:41

.DictReader()

header

row['header']

1 Answer
1

import csv
with open('input.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
url = url = ".../search?info={}&who={}".format(row[0], row[1])
#rest of your logic

i have update my code and i have File “jo.py", line 8, in <module> for row in reader: ValueError: I/O operation on closed file
– jarodfrance
Jun 29 at 20:30

It was not indented. I have edited it. can you check now? for statement should be inside the with block
– Linda
Jun 29 at 20:35

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Attribute Error: ‘None Type’ object has no attribute ‘get_text’

Attribute Error: ‘None Type’ object has no attribute ‘get_text’

I have tried this program as I am scrapping data from Amazon but this program giving me the error. instead of get_text I also tried extract() and only strip () also they all are giving Attribute error. Now please help me what should I do?

import urllib.request
from bs4 import BeautifulSoup
import pymysql.cursors

a = input ('enter the item to be searched :')
a = a.replace(" ","")

html = urllib.request.urlopen("https://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="+a)

bsObj = BeautifulSoup(html,'lxml')
recordList = bsObj.findAll('a', class_='a-link-normal a-text-normal')

connection = pymysql.connect(host='localhost',
user='root',
password='',
db='shopping',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)

try:
with connection.cursor() as cursor:
for record in recordList:
name = record.find("h2", {"class": "a-size-small a-color-base s-inline s-access-title a-text-normal", }).get_text().strip()
sale_price = record.find("span", {"class": "currencyINR"}).get_text().strip()
category = record.find("span", {"class": "a-color-base a-text-bold"}).get_text().strip()
sql = "INSERT INTO `amazon` (`name`, `sale_price`, `category`) VALUES (%s, %s, %s)"
cursor.execute(sql, (name, sale_price, category))
connection.commit()
finally:
connection.close()

This just means that the result of your call to record.find is None.
– MoxieBall
Jun 28 at 16:51

record.find

None

1 Answer
1

Like MoxieBall said in the comment above, your call to record.find is returning a None value.
Try checking the value before you call the subsequent .get_text method

Might look something like

raw_sale_price = record.find("span", {"class": "currencyINR"})
if raw_sale_price:
sale_price = raw_sale_price.get_text().strip()

By clicking “Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.