|
|
Export XML and CSV Data Files in Website Scraper
A1 Website Scraper - Export website data to XML and CSV files
Export Website Data into CSV and Excel XML Files
You can enable the File | Export
menu item by
clicking
/
selecting
/
focusing
the control which contains the data you wish to export.
How to export website data to CSV, text and similar file formats:
In the screenshot below you can see:
Note: Screenshot is from A1 Website Analyzer which has more data columns and filtering options than A1 Website Scraper.
- Most lists, text boxes, tree, grid views and similar can have the data they contain be exported as-is to text or CSV files.
- The controls containing all found URLs during website crawls can also export to Excel XML spreadsheet format.
How to export website data to CSV, text and similar file formats:
- Select the control, e.g. by clicking the mouse cursor on it.
- Adjust the control, e.g. by enabling/disabling visibility of data columns.
- The File | Export menu item is now enabled if applicable. (There is also a corresponding toolbar button.)
- Choose between saving as comma value separated .csv, tab value separated .tsv, .html and more.
In the screenshot below you can see:
- We have selected the tree view control at the left side.
- We have configured visible data columns and filtered visible URLs to control what gets exported.

Note: Screenshot is from A1 Website Analyzer which has more data columns and filtering options than A1 Website Scraper.
Format Options for CSV Data Export
View options for website scraper export of CSV files in menu File - Export options:
(Selecting ANSI for CSV export in website scraper)
- Data included:
- Export CSV Data with Headers
- Export CSV Data with URL
- Wrap cells with line breaks in "" (instead of converting line breaks to spaces)
- Character format and encoding:
- UTF-8 with optional BOM. (ASCII is a subset of UTF-8. Ideal for English documents.)
- UTF-16 LE (UCS-2) with optional BOM. (Used internally in current Windows systems.)
- Local ANSI codepage. (May not always be portable to other platforms and languages.)

(Selecting ANSI for CSV export in website scraper)
Unicode CSV Files and OpenOffice or Microsoft Office Import
Some versions of Open Office, Libre Office and Microsoft Office can have problems importing CSV data
since they do not automatically detect character encoding format.
If you experience problems (not likely for e.g. English website data exports), you
can use the import dialog in the Office tools:
(Selecting UTF-8 for CSV Import in Open Office / Libre Office dialog)
(Selecting ANSI for CSV Import in Microsoft Office dialog)

(Selecting UTF-8 for CSV Import in Open Office / Libre Office dialog)

(Selecting ANSI for CSV Import in Microsoft Office dialog)
Project Website Data is Saved as XML
Structure data extracted from a resource is often called
META data or "data about data". When you save projects in
A1 Website Scraper a vast amount
of such data is saved into the XML files.
Because it is XML, you can easily perform data analysis and datamining (mine the data for more information). There exist wrappers for this in almost all languages, e.g. Java, PHP, C#, Visual Basic, Delphi etc.
If you have saved your project to c:\projects\myproject.ini, you can find the XML files at c:\projects\myproject\.
If you prefer to have easy-to-read fields and indented XML, you should uncheck Options - Favour save/load XML speed. However, if you have huge websites, and are using software to perform further datamining, you may want to leave this option checked since it decreases the XML document sizes with up to 30%.
Because it is XML, you can easily perform data analysis and datamining (mine the data for more information). There exist wrappers for this in almost all languages, e.g. Java, PHP, C#, Visual Basic, Delphi etc.

- Website project meta data is stored in XML documents perfect for data mining. Some examples:
- Totals data:
- Total amount of links within a site
- Total amount of pages that link within a site
- Minimum amount of links any page has to it
- Maximum amount of links any page has to it
- Minimum amount of pages any page has linking to it
- Maximum amount of pages any page has linking to it
- Items collection data:
- Amount of items found. This can be pages, images, etc.
- Item data:
- Page title
- Response headers
- Response code
- Response text
- Response time
- Download time
- Full path
- Relative path (within site)
- File extension
- File kind
- File size
- Charset
- Last modified (HTTP header)
- Links found list
- Linked to from list (includes a list and count of all pages and links)
- Used as source from list (e.g. wherefrom an image or javascript is used)
- Redirected to from list (view all and full redirection chains)
- Summary data about what was found within a directory; file types, how many of these not found, etc.
- Calculated page importance. Raw value and 0-10 scaled. For details, see the "website data" section.
- Totals data:
If you have saved your project to c:\projects\myproject.ini, you can find the XML files at c:\projects\myproject\.
If you prefer to have easy-to-read fields and indented XML, you should uncheck Options - Favour save/load XML speed. However, if you have huge websites, and are using software to perform further datamining, you may want to leave this option checked since it decreases the XML document sizes with up to 30%.
XML File Structure and Documentation
Field name | Speed config | Description |
<data> | ||
----<meta> | ||
--------<version> | ||
--------<fast> | ||
--------<dataexrefs> | ||
----</meta> | ||
----<structure> | ||
--------<rootpath> | ||
--------<checkedlevel> | ||
----</structure> | ||
----<totals> | ||
--------<linked> | ||
------------<allpagesto> | ||
------------<minpagesto> | ||
------------<maxpagesto> | ||
------------<allrefersto> | ||
------------<minrefersto> | ||
------------<maxrefersto> | ||
--------<linked> | ||
----</totals> | ||
----<items> | ||
--------<item> * | ||
------------<imb> | information meta data | |
----------------<fs_ar> | analysis required | |
----------------<fs_as> | analysis started | |
----------------<fs_ac> | analysis completed | |
------------</imb> | ||
------------<title> | ||
------------<allheaderstext> | <allht> | |
------------<responsecode> | <recode> | |
------------<responsetimeouter> | <reto> | |
------------<downloadtimeouter> | <doto> | |
------------<pathroot> | ||
------------<pathrela> | ||
------------<realext> | ||
------------<kindext> | ||
------------<valerrs> | ||
------------<charset> | ||
------------<sizeexpected> | <sizeex> | |
------------<sizeconfirmed> | <sizeco> | |
------------<lastmodified> | <lastmo> | |
------------<revisitaftermins> | <revmins> | |
------------<linkstotalall> | <lksta> | |
------------<linkstotalto> | <lkstt> | |
------------<linkstolist> | <lkstl> | |
----------------<linkstoitem> * | <lksti> | |
------------</linkstolist> | </lkstl> | |
------------<linkedtotalall> | <lnkta> | |
------------<linkedtotalfrom> | <lnktf> | |
------------<linkedfromlist> | <lnkfl> | |
----------------<linkedfromitem> * | <lnkfi> | |
------------</linkedfromlist> | </lnkfl> | |
------------<sourcedtotalall> | <srcta> | |
------------<sourcedtotalfrom> | <srctf> | |
------------<sourcedfromlist> | <srcfl> | |
----------------<sourcedfromitem> * | <srcfi> | |
------------</sourcedfromlist> | </srcfl> | |
------------<redirectedtotalall> | <redta> | |
------------<redirectedtotalfrom> | <redtf> | |
------------<redirectedfromlist> | <redfl> | |
----------------<redirectedfromitem> * | <redfi> | |
--------------------<redirectedfromitemfrom> | <redfif> | |
--------------------<redirectedfromitemtype> | <redfit> | |
--------------------<redirectedfromitemchain> | <redfic> | |
------------------------<redirectedfromitemring> * | <redfir> | |
--------------------</redirectedfromitemchain> | </redfic> | |
----------------</redirectedfromitem> | </redfi> | |
------------</redirectedfromlist> | </redfl> | |
------------<importancescore> | ||
------------<importancescorescaled> | ||
------------<changefreqscorescaled> | ||
------------<summaryfoundall> | ||
------------<summaryfoundlist> | ||
----------------<summaryfounditem> * | ||
--------------------<summaryfounditemisdir> | ||
--------------------<summaryfounditemextreal> | ||
--------------------<summaryfounditemextkind> | ||
--------------------<summaryfounditemresponsecode> | ||
--------------------<summaryfounditemcount> | ||
----------------</summaryfounditem> | ||
------------</summaryfoundlist> | ||
--------</item> | ||
----</items> | ||
</data> |
