Skip to content

Converting HTML table to CSV file from shell

An answer to this question on Stack Overflow.

Question

im trying to convert a file with an HTML table to CSV format. An excerpt from this file follows:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
    </title></head>
    <body>
        <form name="form1" method="post" action="mypricelist.aspx" id="form1">
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />
<div>
    <table id="price_list" border="0">
<tr>
	<td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
	<td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
	<td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

i tried using

    xls2csv -x -c\; evprice.xls > evprice.csv

but that gives me an error saying

    evprice.xls is not OLE file or Error

I googled. it said that is was because the file wasn't proper xls but just html.

When i try

    file evprice.xls

its says its html so found a 'solution', using libreoffice.

    libreoffice --headless -convert-to csv ./evprice.xls 

well this does not give an error but the csv output file is all weird, like opening an exe file in notepad.

it contains a lot of strange characters like these

    —¬ž­þ9ü~ÆóXþK¢

anyone know why this is happening, and got a working solution?

Answer

I have built a Python utility which converts all the tables in an HTML file into separate CSV files.

You can find it here.

The crux of the script is this:

from BeautifulSoup import BeautifulSoup
import csv
filename = "MY_HTML_FILE"
fin      = open(filename,'r')
print "Opening file"
fin  = fin.read()
print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)
print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]
print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
    fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in table.findAll('tr'):
      cols = row.findAll(['td','th'])
      if cols:
        cols = [x.text for x in cols]
        fout.writerow(cols)