目标:爬取链家官方网站新房的数据(3-5页即可,太多可能被封禁ip)
网址:https://bj.fang.lianjia.com/loupan/
要求:将楼盘名称、价格、平米数等(可以拓展)数据保存到一个json文件中。
交付:整个project的压缩包(rar或zip格式)。压缩包名要求为 "ID-作业序号"!
我的答案
代码片段和文件信息
属性 大小 日期 时间 名称
----------- --------- ---------- ----- ----
目录 0 2020-05-19 11:01 lianjia
目录 0 2020-05-19 11:02 lianjia.idea
目录 0 2020-05-09 16:17 lianjia.ideainspectionProfiles
文件 174 2020-05-09 11:03 lianjia.ideainspectionProfilesprofiles_settings.xml
文件 361 2020-05-09 11:03 lianjia.idealianjia.iml
文件 198 2020-05-09 11:03 lianjia.ideamisc.xml
文件 273 2020-05-09 11:03 lianjia.ideamodules.xml
文件 6342 2020-05-19 11:02 lianjia.ideaworkspace.xml
文件 17790 2020-05-19 10:59 lianjiaMyData.json
目录 0 2020-05-09 16:19 lianjiavenv
目录 0 2020-05-09 11:02 lianjiavenvInclude
目录 0 2020-05-09 16:17 lianjiavenvLib
目录 0 2020-05-09 16:19 lianjiavenvLibsite-packages
目录 0 2020-05-09 16:17 lianjiavenvLibsite-packagesattr
目录 0 2020-05-09 16:17 lianjiavenvLibsite-packagesattrs-19.3.0.dist-info
文件 4 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoINSTALLER
文件 1082 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoLICENSE
文件 9022 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infometaDATA
文件 2184 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoRECORD
文件 5 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-info op_level.txt
文件 110 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoWHEEL
文件 2141 2020-05-09 11:19 lianjiavenvLibsite-packagesattrconverters.py
文件 351 2020-05-09 11:19 lianjiavenvLibsite-packagesattrconverters.pyi
文件 1635 2020-05-09 11:19 lianjiavenvLibsite-packagesattrexceptions.py
文件 458 2020-05-09 11:19 lianjiavenvLibsite-packagesattrexceptions.pyi
文件 1098 2020-05-09 11:19 lianjiavenvLibsite-packagesattrfilters.py
文件 214 2020-05-09 11:19 lianjiavenvLibsite-packagesattrfilters.pyi
文件 0 2020-05-09 11:19 lianjiavenvLibsite-packagesattrpy.typed
文件 11460 2020-05-09 11:19 lianjiavenvLibsite-packagesattrvalidators.py
文件 1868 2020-05-09 11:19 lianjiavenvLibsite-packagesattrvalidators.pyi
文件 7326 2020-05-09 11:19 lianjiavenvLibsite-packagesattr\_compat.py
............此处省略4028个文件信息
import logging
import re
from collections import namedtuple
from datetime import time
import six
from six.moves.urllib.parse import (ParseResult quote urlparse
urlunparse)
logger = logging.getLogger(__name__)
_Rule = namedtuple(‘Rule‘ [‘field‘ ‘value‘])
RequestRate = namedtuple(
‘RequestRate‘ [‘requests‘ ‘seconds‘ ‘start_time‘ ‘end_time‘])
_DISALLOW_DIRECTIVE = {‘disallow‘ ‘dissallow‘ ‘dissalow‘ ‘disalow‘ ‘diasllow‘ ‘disallaw‘}
_ALLOW_DIRECTIVE = {‘allow‘}
_USER_AGENT_DIRECTIVE = {‘user-agent‘ ‘useragent‘ ‘user agent‘}
_SITEMAP_DIRECTIVE = {‘sitemap‘ ‘sitemaps‘ ‘site-map‘}
_CRAWL_DELAY_DIRECTIVE = {‘crawl-delay‘ ‘crawl delay‘}
_REQUEST_RATE_DIRECTIVE = {‘request-rate‘ ‘request rate‘}
_HOST_DIRECTIVE = {‘host‘}
_WILDCARDS = {‘*‘ ‘$‘}
_HEX_DIGITS = set(‘0123456789ABCDEFabcdef‘)
__all__ = [‘RequestRate‘ ‘Protego‘]
def _is_valid_directive_field(field):
return any([field in _DISALLOW_DIRECTIVE
field in _ALLOW_DIRECTIVE
field in _USER_AGENT_DIRECTIVE
field in _SITEMAP_DIRECTIVE
field in _CRAWL_DELAY_DIRECTIVE
field in _REQUEST_RATE_DIRECTIVE
field in _HOST_DIRECTIVE])
def _enforce_path(pattern):
if pattern.startswith(‘/‘):
return pattern
return ‘/‘ + pattern
class _URLPattern(object):
“““Internal class which represents a URL pattern.“““
def __init__(self pattern):
self._pattern = pattern
self.priority = len(pattern)
self._contains_asterisk = ‘*‘ in self._pattern
self._contains_dollar = self._pattern.endswith(‘$‘)
if self._contains_asterisk:
self._pattern_before_asterisk = self._pattern[:self._pattern.find(‘*‘)]
elif self._contains_dollar:
self._pattern_before_dollar = self._pattern[:-1]
self._pattern_compiled = False
def match(self url):
“““Retun True if pattern matches the given URL otherwise return False.“““
# check if pattern is already compiled
if self._pattern_compiled:
return self._pattern.match(url)
if not self._contains_asterisk:
if not self._contains_dollar:
# answer directly for patterns without wildcards
return url.startswith(self._pattern)
# pattern only contains $ wildcard.
return url == self._pattern_before_dollar
if not url.startswith(self._pattern_before_asterisk):
return False
self._pattern = self._prepare_pattern_for_regex(self._pattern)
self._pattern = re.compile(self._pattern)
self._pattern_compiled = True
return self._pattern.match(url)
def _prepare_pattern_for_regex(self pattern):
“““Return equivalent regex pattern for the given URL pattern.“““
pattern = re.sub(r‘*+‘ ‘*‘ pattern)
s = re.split(r‘(*|$$)‘ pattern)
for index substr in
属性 大小 日期 时间 名称
----------- --------- ---------- ----- ----
目录 0 2020-05-19 11:01 lianjia
目录 0 2020-05-19 11:02 lianjia.idea
目录 0 2020-05-09 16:17 lianjia.ideainspectionProfiles
文件 174 2020-05-09 11:03 lianjia.ideainspectionProfilesprofiles_settings.xm
文件 361 2020-05-09 11:03 lianjia.idealianjia.iml
文件 198 2020-05-09 11:03 lianjia.ideamisc.xm
文件 273 2020-05-09 11:03 lianjia.ideamodules.xm
文件 6342 2020-05-19 11:02 lianjia.ideaworkspace.xm
文件 17790 2020-05-19 10:59 lianjiaMyData.json
目录 0 2020-05-09 16:19 lianjiavenv
目录 0 2020-05-09 11:02 lianjiavenvInclude
目录 0 2020-05-09 16:17 lianjiavenvLib
目录 0 2020-05-09 16:19 lianjiavenvLibsite-packages
目录 0 2020-05-09 16:17 lianjiavenvLibsite-packagesattr
目录 0 2020-05-09 16:17 lianjiavenvLibsite-packagesattrs-19.3.0.dist-info
文件 4 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoINSTALLER
文件 1082 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoLICENSE
文件 9022 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infome
文件 2184 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoRECORD
文件 5 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-info op_level.txt
文件 110 2020-05-09 11:19 lianjiavenvLibsite-packagesattrs-19.3.0.dist-infoWHEEL
文件 2141 2020-05-09 11:19 lianjiavenvLibsite-packagesattrconverters.py
文件 351 2020-05-09 11:19 lianjiavenvLibsite-packagesattrconverters.pyi
文件 1635 2020-05-09 11:19 lianjiavenvLibsite-packagesattrexceptions.py
文件 458 2020-05-09 11:19 lianjiavenvLibsite-packagesattrexceptions.pyi
文件 1098 2020-05-09 11:19 lianjiavenvLibsite-packagesattrfilters.py
文件 214 2020-05-09 11:19 lianjiavenvLibsite-packagesattrfilters.pyi
文件 0 2020-05-09 11:19 lianjiavenvLibsite-packagesattrpy.typed
文件 11460 2020-05-09 11:19 lianjiavenvLibsite-packagesattrvalidators.py
文件 1868 2020-05-09 11:19 lianjiavenvLibsite-packagesattrvalidators.pyi
文件 7326 2020-05-09 11:19 lianjiavenvLibsite-packagesattr\_compat.py
............此处省略4028个文件信息
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容, 请发送邮件举报,一经查实,本站将立刻删除。
评论列表(条)