抓取博客园新闻数据

作者: 网络编程  发布:2019-09-24

1. 引言

相比于Java而言,基于C#的开源爬虫就少了累累。入行这么日久天长也尚未接触过爬虫。出于兴趣前几天给大家介绍贰个C# 的爬虫工具Abot. 供给接纳能够由此Nuget获取。Abot本身就帮忙多线程的爬取,

里头使用CsQuery来解析爬取到的Html文书档案。纯熟jQuery的同桌断定能神速上手CsQuery, 它正是一个C#版本的jQuery。

此间以爬取新浪当天的资源新闻数据为例,看看怎样采纳Abot。

2. 知乎消息页面

这是天涯论坛的音讯首页。能够见到规范的分页展现。比方 那是情报的第二页。

真的的新闻详细页面 举例: 通过正则表明式可以比较轻便的相配那三种Url 类型。

理所必然大家得以通过叁个 for 循环分别爬取种种page的情报数据。然后剖判出宣布于后天的新闻。然而本身希望只以 为种子页面,爬取前天的情报。

由于和讯音信分页并非使用Ajax,对于爬虫来说那特别和气

图片 1

故而大家定义

        /// <summary>        /// 种子Url        /// </summary>        public static readonly Uri FeedUrl = new Uri(@"http://news.cnblogs.com/");        /// <summary>        ///匹配新闻详细页面的正则         /// </summary>        public static Regex NewsUrlRegex = new Regex("^http://news.cnblogs.com/n/\d+/$", RegexOptions.Compiled);        /// <summary>        /// 匹配分页正则         /// </summary>        public static Regex NewsPageRegex = new Regex("^http://news.cnblogs.com/n/page/\d+/$", RegexOptions.Compiled);

3. 实现

Abot 其实已经对爬虫内部贯彻封装的充足精细,使用者只必要安装有个别Config 参数和爬取页面包车型地铁有些事变就可以。

        public static IWebCrawler GetManuallyConfiguredWebCrawler()        {            CrawlConfiguration config = new CrawlConfiguration();            config.CrawlTimeoutSeconds = 0;            config.DownloadableContentTypes = "text/html, text/plain";            config.IsExternalPageCrawlingEnabled = false;            config.IsExternalPageLinksCrawlingEnabled = false;            config.IsRespectRobotsDotTextEnabled = false;            config.IsUriRecrawlingEnabled = false;            config.MaxConcurrentThreads = System.Environment.ProcessorCount;            config.MaxPagesToCrawl = 1000;            config.MaxPagesToCrawlPerDomain = 0;            config.MinCrawlDelayPerDomainMilliSeconds = 1000;            var crawler = new PoliteWebCrawler(config, null, null, null, null, null, null, null, null);            crawler.ShouldCrawlPage(ShouldCrawlPage);            crawler.ShouldDownloadPageContent(ShouldDownloadPageContent);            crawler.ShouldCrawlPageLinks(ShouldCrawlPageLinks);            crawler.PageCrawlStartingAsync += crawler_ProcessPageCrawlStarting;                        //爬取页面后的回调函数            crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompletedAsync;            crawler.PageCrawlDisallowedAsync += crawler_PageCrawlDisallowed;            crawler.PageLinksCrawlDisallowedAsync += crawler_PageLinksCrawlDisallowed;            return crawler;        }

现实调用极其简单:

        public static void Main(string[] args)        {            var crawler = GetManuallyConfiguredWebCrawler();            var result = crawler.Crawl;            System.Console.WriteLine(result.ErrorException);        }

最关键的是PageCrawlCompletedAsync,能够在该事件下猎取须要的页面数据。

        public static void crawler_ProcessPageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)        {            //判断是否是新闻详细页面            if (NewsUrlRegex.IsMatch(e.CrawledPage.Uri.AbsoluteUri))            {                //获取信息标题和发表的时间                   var csTitle = e.CrawledPage.CsQueryDocument.Select("#news_title");                var linkDom = csTitle.FirstElement().FirstChild;                var newsInfo = e.CrawledPage.CsQueryDocument.Select("#news_info");                var dateString = newsInfo.Select(".time", newsInfo);                //判断是不是今天发表的                  if (IsPublishToday(dateString.Text                {                    var str = (e.CrawledPage.Uri.AbsoluteUri + "t" + HtmlData.HtmlDecode(linkDom.InnerText) + "rn");                    System.IO.File.AppendAllText("fake.txt", str);                }            }        }        /// <summary>        /// "发布于 2016-05-09 11:25" => true        /// </summary>        public static bool IsPublishToday(string str)        {            if (string.IsNullOrEmpty            {                return false;            }            const string prefix = "发布于";            int index = str.IndexOf(prefix, StringComparison.OrdinalIgnoreCase);            if (index >= 0)            {                str = str.Substring(prefix.Length).Trim();            }            DateTime date;            return DateTime.TryParse(str, out date) && date.Date.Equals(DateTime.Today);        }

为了进步爬取的功用 例如在首页爬虫抓取到 明显那样的链接大家没有要求, 那就能够安装爬取的平整:

        /// <summary>        /// 如果是Feed页面或者分页或者详细页面才需要爬取         /// </summary>        private static CrawlDecision ShouldCrawlPage(PageToCrawl pageToCrawl, CrawlContext context)        {            if (pageToCrawl.IsRoot || pageToCrawl.IsRetry || FeedUrl == pageToCrawl.Uri                || NewsPageRegex.IsMatch(pageToCrawl.Uri.AbsoluteUri)                || NewsUrlRegex.IsMatch(pageToCrawl.Uri.AbsoluteUri))            {                return new CrawlDecision {Allow = true};            }            else            {                return new CrawlDecision {Allow = false, Reason = "Not match uri"};            }        }        /// <summary>        /// 如果是Feed页面或者分页或者详细页面才需要爬取         /// </summary>        private static CrawlDecision ShouldDownloadPageContent(PageToCrawl pageToCrawl, CrawlContext crawlContext)        {            if (pageToCrawl.IsRoot || pageToCrawl.IsRetry || FeedUrl == pageToCrawl.Uri                || NewsPageRegex.IsMatch(pageToCrawl.Uri.AbsoluteUri)                || NewsUrlRegex.IsMatch(pageToCrawl.Uri.AbsoluteUri))            {                return new CrawlDecision                {                    Allow = true                };            }            return new CrawlDecision { Allow = false, Reason = "Not match uri" };        }        private static CrawlDecision ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext)        {            if (!crawledPage.IsInternal)                return new CrawlDecision {Allow = false, Reason = "We dont crawl links of external pages"};            if (crawledPage.IsRoot || crawledPage.IsRetry || crawledPage.Uri == FeedUrl                || NewsPageRegex.IsMatch(crawledPage.Uri.AbsoluteUri))            {                return new CrawlDecision {Allow = true};            }            else            {                return new CrawlDecision {Allow = false, Reason = "We only crawl links of pagination pages"};            }        }

最后抓到的多少:

图片 2

4. 总结

Abot 依旧二个不行有利爬虫,如若选拔到骨子里生产条件中,参数配置是首先必要消除的,举例马克斯PagesToCrawl 最大抓取的页面数,还足以设置爬虫内部存款和储蓄器限制等。

应接访谈小编的私有网址 51zhang.net 网址还在不断开拓中…

本文由王中王开奖结果发布于网络编程,转载请注明出处:抓取博客园新闻数据

关键词: