lianjia.zip


目标:爬取链家官方网站新房的数据(3-5页即可,太多可能被封禁ip) 网址:https://bj.fang.lianjia.com/loupan/ 要求:将楼盘名称、价格、平米数等(可以拓展)数据保存到一个json文件中。 交付:整个project的压缩包(rar或zip格式)。压缩包名要求为 "ID-作业序号"! 我的答案
资源截图
代码片段和文件信息
import logging
import re
from collections import namedtuple
from datetime import time

import six
from six.moves.urllib.parse import (ParseResult quote urlparse
                                    urlunparse)

logger = logging.getLogger(__name__)

_Rule = namedtuple(‘Rule‘ [‘field‘ ‘value‘])
RequestRate = namedtuple(
    ‘RequestRate‘ [‘requests‘ ‘seconds‘ ‘start_time‘ ‘end_time‘])

_DISALLOW_DIRECTIVE = {‘disallow‘ ‘dissallow‘ ‘dissalow‘ ‘disalow‘ ‘diasllow‘ ‘disallaw‘}
_ALLOW_DIRECTIVE = {‘allow‘}
_USER_AGENT_DIRECTIVE = {‘user-agent‘ ‘useragent‘ ‘user agent‘}
_SITEMAP_DIRECTIVE = {‘sitemap‘ ‘sitemaps‘ ‘site-map‘}
_CRAWL_DELAY_DIRECTIVE = {‘crawl-delay‘ ‘crawl delay‘}
_REQUEST_RATE_DIRECTIVE = {‘request-rate‘ ‘request rate‘}
_HOST_DIRECTIVE = {‘host‘}

_WILDCARDS = {‘*‘ ‘$‘}

_HEX_DIGITS = set(‘0123456789ABCDEFabcdef‘)

__all__ = [‘RequestRate‘ ‘Protego‘]


def _is_valid_directive_field(field):
    return any([field in _DISALLOW_DIRECTIVE
                field in _ALLOW_DIRECTIVE
                field in _USER_AGENT_DIRECTIVE
                field in _SITEMAP_DIRECTIVE
                field in _CRAWL_DELAY_DIRECTIVE
                field in _REQUEST_RATE_DIRECTIVE
                field in _HOST_DIRECTIVE])


def _enforce_path(pattern):
    if pattern.startswith(‘/‘):
        return pattern

    return ‘/‘ + pattern


class _URLPattern(object):
    “““Internal class which represents a URL pattern.“““

    def __init__(self pattern):
        self._pattern = pattern
        self.priority = len(pattern)
        self._contains_asterisk = ‘*‘ in self._pattern
        self._contains_dollar = self._pattern.endswith(‘$‘)

        if self._contains_asterisk:
            self._pattern_before_asterisk = self._pattern[:self._pattern.find(‘*‘)]
        elif self._contains_dollar:
            self._pattern_before_dollar = self._pattern[:-1]

        self._pattern_compiled = False

    def match(self url):
        “““Retun True if pattern matches the given URL otherwise return False.“““
        # check if pattern is already compiled
        if self._pattern_compiled:
            return self._pattern.match(url)

        if not self._contains_asterisk:
            if not self._contains_dollar:
                # answer directly for patterns without wildcards
                return url.startswith(self._pattern)

            # pattern only contains $ wildcard.
            return url == self._pattern_before_dollar

        if not url.startswith(self._pattern_before_asterisk):
            return False

        self._pattern = self._prepare_pattern_for_regex(self._pattern)
        self._pattern = re.compile(self._pattern)
        self._pattern_compiled = True
        return self._pattern.match(url)

    def _prepare_pattern_for_regex(self pattern):
        “““Return equivalent regex pattern for the given URL pattern.“““
        pattern = re.sub(r‘*+‘ ‘*‘ pattern)
        s = re.split(r‘(*|$$)‘ pattern)
        for index substr in

 属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2020-05-19 11:01  lianjia
     目录           0  2020-05-19 11:02  lianjia.idea
     目录           0  2020-05-09 16:17  lianjia.ideainspectionProfiles
     文件         174  2020-05-09 11:03  lianjia.ideainspectionProfilesprofiles_settings.xml
     文件         361  2020-05-09 11:03  lianjia.idealianjia.iml
     文件         198  2020-05-09 11:03  lianjia.ideamisc.xml
     文件         273  2020-05-09 11:03  lianjia.ideamodules.xml
     文件        6342  2020-05-19 11:02  lianjia.ideaworkspace.xml
     文件       17790  2020-05-19 10:59  lianjiaMyData.json
     目录           0  2020-05-09 16:19  lianjiavenv
     目录           0  2020-05-09 11:02  lianjiavenvInclude
     目录           0  2020-05-09 16:17  lianjiavenvLib
     目录           0  2020-05-09 16:19  lianjiavenvLibsite-packages
     目录           0  2020-05-09 16:17  lianjiavenvLibsite-packagesattr
     目录           0  2020-05-09 16:17  lianjiavenvLibsite-packagesattrs-19.3.0.dist-info
     文件           4  2020-05-09 11:19  lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoINSTALLER
     文件        1082  2020-05-09 11:19  lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoLICENSE
     文件        9022  2020-05-09 11:19  lianjiavenvLibsite-packagesattrs-19.3.0.dist-infometaDATA
     文件        2184  2020-05-09 11:19  lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoRECORD
     文件           5  2020-05-09 11:19  lianjiavenvLibsite-packagesattrs-19.3.0.dist-info op_level.txt
     文件         110  2020-05-09 11:19  lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoWHEEL
     文件        2141  2020-05-09 11:19  lianjiavenvLibsite-packagesattrconverters.py
     文件         351  2020-05-09 11:19  lianjiavenvLibsite-packagesattrconverters.pyi
     文件        1635  2020-05-09 11:19  lianjiavenvLibsite-packagesattrexceptions.py
     文件         458  2020-05-09 11:19  lianjiavenvLibsite-packagesattrexceptions.pyi
     文件        1098  2020-05-09 11:19  lianjiavenvLibsite-packagesattrfilters.py
     文件         214  2020-05-09 11:19  lianjiavenvLibsite-packagesattrfilters.pyi
     文件           0  2020-05-09 11:19  lianjiavenvLibsite-packagesattrpy.typed
     文件       11460  2020-05-09 11:19  lianjiavenvLibsite-packagesattrvalidators.py
     文件        1868  2020-05-09 11:19  lianjiavenvLibsite-packagesattrvalidators.pyi
     文件        7326  2020-05-09 11:19  lianjiavenvLibsite-packagesattr\_compat.py
............此处省略4028个文件信息

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件举报,一经查实,本站将立刻删除。

发表评论

评论列表(条)